2024
Less known (now; once bestseller) book by Darrell Huff (142 pages/a5)
BTW: this photo (taken in 2015) coupled with the fact that Gates funded the epidemiology research at John Hopkins University has become “evidence” for various morons (of which the are plenty in the USA), that Gates was behind the COVID19 pandemic
A book written by Darrell Huff in 1954 presenting an introduction to statistics for the general reader. Not a statistician, Huff was a journalist […]
In the 1960/1970s, it became a standard textbook introduction to the subject of statistics for many college students […] one of the best-selling statistics books in history.
https://en.wikipedia.org/wiki/How_to_Lie_with_Statistics
The book consists of 10 chapters and is written in a provocative, way (unscientific). Individual chapters are so well known that if you enter the title of the chapter into google will return hundreds of thousands references
ch1: The Sample with the Built-in Bias (ie it is very difficult to draw unbiased/perfect random sample)
ch2: The Well-Chosen Average. You can manipulate average value in various ways: using various averages/using different definitions of averaged units/measuring in various ways
ch3: The Little Figures That Are Not There (Figures = Details) Reporting results w/o context or important information in short
ch4: Continuing #ch3 insignificant results = difference is of no practical meaning.
ch5: The Gee-Whiz Graphs (Statistical graphs in cartesian coordinates with OY axis not starting from zero) https://en.wikipedia.org/wiki/Gee_Whiz → https://en.wikipedia.org/wiki/Misleading_graph
ch6: The One-Dimensional Picture (comparing 1D quantities using 2D or pseudo-3D) https://thejeshgn.com/2017/11/17/how-to-lie-with-graphs/
ch7: The Semiattached Figure. Using one thing as a way to claim proof of something else, even though there’s no correlation between the two (not attached) https://www.secjuice.com/the-semi-attached-figure/
ch8: Post Hoc Rides Again (Correlation is not causation)
ch9: Misinforming people by the use of statistical material might be called statistical manipulation, in a word, Statisticulation. (summary of ch1–ch8)
ch10: How to Talk Back to a Statistic (How not to be deceived)
Who Says So? (interested parties can be unreliable; car seller reputation is poor);
How Does He Know? (measurement is often unreliable);
What’s Missing? (incomplete analysis signals bias);
Many figures lose meaning because a comparison is missing. In Poland there was a public discussion about falling fertility– women in Poland do not give birth to children; the average age of a mother at the birth of her first child is 27 years. [It is a norm in a whole Europe]
Did Somebody Change The Subject? (beware of the Semiattached Figure)
Does It Make Sense? (forget about statistics and think about common sense)
Despite its mathematical base, statistics is a much an art as it is a science (Huff p. 120)
Unfortunatelly quite opposite…
Misleading statistical analyzes are still doing quite well if not better than in Huff’s times, which is probably due to the following factors:
the number of statisticians has increased exponentially, often amateurs (everyone can count something easily today)
the amount of readily available data has increased exponentially too
Not only charts can be misleading (intentionally or not), but this lecture is about charts. Because charts are ubiquitous. Because statistical charts have become a favorite way for the media, including electronic and social media, to present results (we are inundated with charts that aim to prove something).
Statistical charts can be created for the following three purposes:
Decorative (To attract someone’s attention; a document without images is dull, colorful pictures are better than black-and-white ones; fancy drawings are better than simple ones. Form is king; content does not matter.)
Explanatory (To better explain a certain phenomenon to someone. It is often said that a picture is worth a thousand words.)
Exploratory (To identify data patterns during the exploratory/preliminary stage of data analysis.)
I will focus on the second point, i.e., on effective graphical methods for explaining relationships in data. One graphical method is more effective than another if the information it contains can be interpreted more efficiently or easily by the audience [Robbins 2005].
Some charts are better than others:
Recommended charts:
Strip charts, Bar charts, Line charts, Histograms, Scatter plots, Panels (instead of stacked bar charts or multi-line charts)
Not Recommended:
Pie charts, Bubble charts, Stacked bar charts, Multi-line charts
Bar charts, line charts, and pie charts were invented by William Playfair (an economist!) in the 18th century. Dot plots were created by John Cleveland in the 1980s. Box plots were introduced by John Tukey in the 1970s.
More Playfair’s charts one cane find via google or in [Syamnzik’s paper] (http://www.math.usu.edu/symanzik/papers/2009_cost/editorial.html)
Florence Nightingale also worked with statistics. The chart below is called the Nightingale Rose. It is a type of stacked bar chart, but in a polar coordinate system
There are twelve sectors (polar bars) — one for each month.
The length of the radius, and thus the area of the sector, depends on the magnitude of the phenomenon it represents (the number of deaths due to: wounds, diseases, and other causes).
FN diagrams (Nightingale’s diagrams) didn’t catch on, but not every new idea is instantly brilliant…
Data visualization involves encoding relationships between numbers (quantitative information) using graphic metaphors (e.g., geometric shapes, angles, colors, positions, etc.).
Some metaphors are more effective than others in terms of clarity and accuracy.
According to William S. Cleveland (known for stripcharts) and Robert McGill in their seminal paper Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods (JASA, 1984), graphic metaphors can be ranked by effectiveness as follows: Position along a common scale ➔ Position along identical, nonaligned scale ➔ Length ➔ Slope or direction/Angle ➔ Area ➔ Volume (pseudo-3D graphics) ➔ Color (hue, saturation, or black density)
Key observations:
Position is the most effective Judging distances along a common scale is precise and intuitive for viewers.
Angles are less effective Humans struggle to compare angles accurately, especially when differences are small. Acute angles tend to be underestimated, while obtuse angles (greater than 90°) are overestimated.
Area comparisons are imprecise Differentiating between objects of similar areas is highly challenging.
Color has low effectiveness While visually appealing, colors (whether hue, saturation, or density) are poor for conveying precise quantitative differences.
These findings highlight why simple, position-based visuals like bar charts and scatter plots outperform complex visuals like pie charts or bubble charts.
The Nomenclature of Territorial Units for Statistics (NUTS) is a geocode standard for referencing the subdivisions of countries for statistical purposes. The standard is developed and regulated by the European Union, and thus only covers the member states of the EU in detail (cf NUTS)
NUTS standard was revised several times (on the average every 4 years :-)), so there is even a page at ec.europa.eu domain dedicated to NUTS (short) history (cf NUTS history)
NUTS1 (level) – macroregion, NUTS2 – state, NUTS3 – subregion (several counties in case of Poland)
Poland is divided into 7 macroregions, 16 states (NUTS2), and 72 subregions (NUTS3).
NUTS1 level is only for statistical purposes (but regions are in fact distinct due to history, economics, natural-conditions, cultural factors etc… )
There is a relevant and interesting page by GUS (Main Statistical Office or Główny Urząd Statystyczny), but unfortunately in Polish (use google translate :-) in case you are interested or mail me) (cf Klasyfikacja NUTS w Polsce )
The above map shows 7 macroregions (NUT1) and 16 provinces (NUTS2). BTW province in Polish is “prowincja” (due to both are from Latin) but actually Polish administrative provice is called “województwo”, from “wodzić” – ie commanding (the armed troops in this context). This is an old term/custom from the 14th century, where Poland was divided into provinces (every province ruled by a “wojewoda” ie chief of that province). More can be found at Wikipedia (cf Administrative divisions of Poland)
NUTS3 consists of 380 counties grouped into 72 subregions.
A Polish county (called “powiat”) is 2-nd level administrative unit.
In ancient Poland powiat was called “starostwo” and the head of a “starostwo was called”starosta”. “Stary” means Old, so “starosta” is an old (and thus wise) person. BTW the head of powiat is “starosta” as 600 years ago:-)
The 3rd level administrative unit is called gmina (municipality).
There are (approximately) 380 counties and 2750 municipalities in Poland.
As Poland population is 38,5 mln and the area equals 312,7 sq kilometers (120 persons per 1 sqkm) on the average each powiat has 820 sqkm and each municipality has 113.5 sqkm or approximately 100 thousand persons per “powiat and 14 thousand per”gmina”.
TERYT is a Polish NUTS (developed some 50 years ago). It is complex system which includes identification of administrative units. Every unit has (up to) a 7-digit id number: wwppggt where ww = “województwo” id, pp = “powiat” id, gg = “gmina” id and “t” decodes type-of-municipality (rural, municipal or mixed). Higher units has trailing zeros for irrelevant part of id, so 14 or 1400000 means the same; as well as 1205 and 1205000. Six numbers is enough to identify a community (approx 2750 units).
So you are now experts on administrative division of Poland, and we can go back to statistical charts…
Example 1: Municipalities in Poland by type (source: Local Data Bank of the Central Statistical Office of Poland/BDL)
Bar chart
Pie chart
If there are few values, a pie chart is fine, but why visualize just three numbers?
Example 2: Land Use in Poland as a Percentage of Total Area (source: BDL)
In this example, the variable takes on more values, which immediately demonstrates the weaknesses of the pie chart.
https://bdl.stat.gov.pl/bdlarch/metadane/podgrupy/441?back=True agricultural land | forests | lands under water
Bar Chart
Pie chart:
Example 3: Nights spent at tourist accommodation establishments by non residents (2017); Noclegi udzielone w roku 2017 wg krajów UE. Source: Eurostat tour_occ_ninat
Pie charts:
Bar charts:
A histogram is a graphical representation of the distribution of a dataset. It shows how frequently each value (or range of values) occurs within the dataset.
Example: The age of Nobel Prize laureates (up to 2018); Source: The Nobel Prize API Developer Hub)
Histograms with a bin (interval) width of 10, 5, 2 and 1 years:
The more values/intervals, the more detailed the histogram becomes, which is not necessarily desirable because it can obscure the overall picture.
There is no “golden rule” for how many intervals there should be, as their number determines the shape and the optical size (i.e., the total area) of the histogram.
The fewer the intervals, the larger the histogram will appear optically.
stacked barchart vs grouped barchart
Gruped bar chart
Stacked bar chart
or
Comparison of three provinces reveals the limitations of pie charts (with more numbers, the pie chart becomes unreadable/ineffective).
Land use in selected provinces
Still insists on using pie charts? 😊
Another example
CBOS (leading Polish government-funded research institute focused on public opinion polling) conducts the survey “Current Problems and Events” at least 12 times a year, on a representative sample of approximately 1,000 adult residents of Poland. (cf https://www.cbos.pl/PL/trendy/trendy.php?)
In this research trust in politicians is measured. This trust is assessed through a single question, which reads as follows:
Public figures—through their actions, what they say, and their goals—evoke varying degrees of trust. We will now present you with a list of individuals active in the political life of our country. For each of them, please indicate the extent to which they inspire your trust. When responding, please use a scale where -5 means that you have deep distrust for the person, 0 means that you are indifferent toward them, and +5 means that you have full trust in them. Of course, you may also use other points on the scale. If you are not familiar with someone, please let us know.
The percentages of respondents expressing trust correspond to ratings from +1 to +5, distrust corresponds to ratings from -1 to -5, and indifference is represented by a rating of 0.
In its summaries, CBOS excludes responses of “difficult to say” (indifference) and refusals to answer.
Stacked barchar
Panel of barcharts
Panel of piecharts (to convince those who remain unconvinced)
Box plots are much better than histograms for comparing distributions.
Construction of a (typical) box plot:
The middle line represents the median.
The top/bottom of the rectangle indicates the first/third quartile (IQR),
The height of the rectangle is the interquartile range.
The fancy lines above/below the rectangle, called whiskers (a cat has whiskers, while a person has a mustache), are defined as Q∗±1.5×IQRQ∗±1.5×IQR.
Symbols above/below the whiskers (usually open circles) represent outliers.
Notice the trick: outliers are not defined as (for example) the upper/lower 1% of all values (because then every distribution would have outliers); rather, they are values smaller/larger than Q∗±1.5×IQRQ∗±1.5×IQR.
All values in distributions with moderate variability fit within such a range.
Example: the age of Nobel Prize laureates.
A strip chart represents the distribution of values along an axis. Such a plot can be used as an alternative to a box plot (because it retains more information about the data).
Example: the age of Nobel Prize laureates.
A serious problem with a strip plot is overlapping points.
There is no perfect solution to this problem, but several techniques can help: use smaller dots, use semi-transparent dots (right panel), or apply jitter.
Jitter is a small random noise added to the data (below; larger jitter in the right panel).
Purpose: to show relationships between two (or more–bubble plot) numeric variables…
Example: GDP versus CO2 emissions
https://data.worldbank.org/indicator/EN.GHG.CO2.MT.CE.AR5?end=2022&start=1970&view=chart Carbon dioxide (CO2) emissions (total) excluding LULUCF (Mt CO2e)
Unitofmeasure: Mt CO2eq ??? I believe it is Millions metric tonnes
GDP vs CO2 emissions cont.
Line plot or bar plot
Purpose:
to show the rate of change; how quickly it increases/decreases, and
to compare dynamics, that is, to assess the changes of one variable relative to another.
Example: Per capita emissions of CO2 equivalents (metric tonnes)
Line plot
Barchart
Stacked barchart
Panel
BTW: there is a problem in EN.GHG.CO2.MT.CE.AR5 description at
https://data.worldbank.org/indicator/EN.GHG.CO2.MT.CE.AR5?end=2022&start=1970&view=chart
https://data.worldbank.org/indicator/EN.GHG.CO2.MT.CE.AR5?end=2022&start=1970&view=map
Climate disaster uttery important, yeh?
Understandable Content. Ensure the reader clearly understands what the chart represents. Include axis labels, scale descriptions, and necessary explanations.
Clear Form. The reader must easily see the presented information. Avoid tangled lines, overlapping elements, or clutter.
Emphasize Data. Highlight the data, not unnecessary elements like grid lines, redundant legends, or meaningless arrows. Keep the design simple.
Axis and Labeling Guidelines. Place tick marks and axis labels externally to avoid clutter. X-axis values should increase from left to right, and Y-axis values from bottom to top—never the reverse. Use a reasonable number of axis labels to avoid overcrowding.
Accessibility and Scalability. Design for readability in black-and-white mode or when scaled down (e.g., photocopied or viewed on a smartphone).
Avoid Overcomplication. Use only as many visual metaphors as there are data dimensions. Bar chart rectangles should be uniform in color. 3D charts are a disaster. They add complexity without improving clarity.
Optimize Baselines and Proportions.
Use a shared baseline when possible, especially for comparison purposes.
For line charts, aim for a 45° slope for optimal proportions.
Use logarithmic scales for large data ranges but avoid truncated scales unless necessary.
Start axes at 0 unless a specific exception justifies otherwise.
Avoid dual axes, as they complicate interpretation.
Prefer Labels Over Legends.
Labels placed directly on the chart are preferable to legends.
Use a legend only when space constraints prevent the use of labels.
Avoid Multi-Line Charts.
Multi-line charts are generally problematic due to: – Multiple scales; – Visual clutter; – Difficulty in assessing differences between lines.
By adhering to these principles, your charts will be more effective, clear, and easier to interpret.
Edward Tufte, a renowned expert in data visualization, proposed two key principles to enhance the clarity and integrity of visualizations:
Definition: The proportion of “ink” (visual elements) dedicated to representing the data versus all the ink used in the chart.
Maximize the data-to-ink ratio, meaning: Minimize decorative or non-essential elements. Focus on presenting as much data as possible in a clear and concise manner.
https://www.youtube.com/watch?v=JIMUzJzqaA8
Practical Advice:
Remove unnecessary grid lines, shading, or other embellishments. Ensure every visual element serves a purpose in communicating data.
Definition: The ratio of the graphical representation of a value to the actual data value.
Ideal Value: LF should equal 100% for accurate representation. LF > 105% or < 95% signifies significant distortion of the data.
Example: Average Female Heights
The chart shows average female heights across various countries.
Latvian women appear to be 4.35 times taller than Indian women based on the chart’s visual effect (135/31≈4.35135/31≈4.35).
However, the actual height difference is far smaller, roughly 169.8 cm/152.6 cm≈1.11169.8 cm/152.6 cm≈1.11 times.
This discrepancy violates Tufte’s Lie Factor rule, as the LF is significantly higher than 105%, misleading the audience.
Maximize the Data-to-Ink Ratio: Prioritize data over decoration.
Minimize the Lie Factor: Ensure the visual effect accurately represents the actual data.
By following these rules, charts can better balance simplicity, clarity, and honesty.
This giant guy (GG) in the middle is our ex-president. The guy next to him on the left is our current president Duda. Next to Duda is ex-rock star Kukiz, dark-horse of the elections. This is the cover (slightly modified) of influential polish weekly magazine form May 2015, shortly before elections.
The figures are claimed to be in-sync with the recent survey results (sort of a barchart). Could you figure-out from that chart about the proportion of scores of each candidate? How much the giant-guy outperforms the runner-up candidate? Which candidate is supported by this influential magazine (easy:-)?
The lie-factor details:
The line from shoes to top of the head equals (at certain size of course) 204mm for GG, 134mm for Duda and 42.5mm for ex-rock star. So \(204/134=1.5\) and \(204/42.5 \approx 4.8\). As \(44/29 \approx 1.5\) and \(44/9 \approx 4.8\) as well formally the lieFactor is perfect. But should one compares lengths or areas?
If one compares areas not heights, one get significantly different (and correct) results, namely: \((204 * 58) /(134 * 21)= 4.20\) and \((204 *58)/(42.5 *15) \approx 18.56\). Lie factor is \(4.2/1.5 =280\)% and \(18.56/4.8=387\)% respectively. Huge distortion
Moreover two more tricks were applied to boost GG. Can you see them?
BTW: the text in the pink frame claims: “figure ratios are consistent with april-may survey outcome.”” (But what exactly figure ratios means?)
The ratio between the width and the height of a rectangle is called its aspect ratio.
The aspect ratio describes the area that is occupied by the data in the chart.
A change in aspect ratio changes the perception of the graph.
The question is which aspect ratio is the best.
We can recognize change most easily if absolute slopes equals to 45 degree angle on the graph. It is much harder to see change if the curves are nearly horizontal/vertical. The idea (Cleveland, 1988) behind banking is therefore to adjust the aspect ratio of the entire plot in such a way that most slopes are at an approximate 45 degree angle.
Setting the aspect ratio so that the average of the values of the orientations is 45 degrees is called “banking the average orientation to 45 degrees”.
Setting the aspect ratio so that the weighted mean of line segments (weighted by segments’ length is approx 45 degrees is called average weighted orientation method (to 45 degrees).
Exercise: assess which slope is the steepest one and which is the smallest one?
BTW: every chart presents the same data on CO2 emission (average for May each year) as provided by US Government’s Earth System Research Laboratory, Global Monitoring Division. (cf CO2 PPM - Trends in Atmospheric Carbon Dioxide)
How many Nobel Prizes have Poles received?
I asked AI
Why even AI has problems?
https://www.youtube.com/watch?v=arKhvVWGXFo
A logarithmic scale should be used when the dataset being visualized has a large range.
As an example, let’s once again consider Nobel Prize laureates, this
time by country of birth (bornCountryCode)…
Scatter plots using different scales on the Y-axis (arithmetic, log2, and log10).
Exact data:
| country | bornCountryCode | n |
|---|---|---|
| United States | US | 269 |
| United Kingdom | GB | 100 |
| Germany | DE | 82 |
| France | FR | 55 |
| Sweden | SE | 29 |
| Japan | JP | 26 |
| Russia | RU | 26 |
| Poland | PL | 25 |
| Canada | CA | 19 |
| Italy | IT | 19 |
| Netherlands | NL | 18 |
| Austria | AT | 17 |
| Switzerland | CH | 17 |
| China | CN | 12 |
| Denmark | DK | 12 |
| Norway | NO | 12 |
| Australia | AU | 10 |
| Belgium | BE | 9 |
| Hungary | HU | 9 |
| South Africa | ZA | 9 |
| India | IN | 8 |
| Spain | ES | 7 |
| Czechia | CZ | 6 |
| Egypt | EG | 6 |
| Israel | IL | 6 |
| Finland | FI | 5 |
| Ireland | IE | 5 |
| Ukraine | UA | 5 |
| Argentina | AR | 4 |
| Belarus | BY | 4 |
| Romania | RO | 4 |
| Lithuania | LT | 3 |
| Mexico | MX | 3 |
| New Zealand | NZ | 3 |
| Pakistan | PK | 3 |
| Turkey | TR | 3 |
| Bosnia & Herzegovina | BA | 2 |
| Chile | CL | 2 |
| Colombia | CO | 2 |
| Algeria | DZ | 2 |
| Guatemala | GT | 2 |
| Iran | IR | 2 |
| South Korea | KR | 2 |
| St. Lucia | LC | 2 |
| Liberia | LR | 2 |
| Luxembourg | LU | 2 |
| Portugal | PT | 2 |
| Timor-Leste | TL | 2 |
| Azerbaijan | AZ | 1 |
| Bangladesh | BD | 1 |
| Bulgaria | BG | 1 |
| Brazil | BR | 1 |
| Costa Rica | CR | 1 |
| Cyprus | CY | 1 |
| Ghana | GH | 1 |
| Guadeloupe | GP | 1 |
| Greece | GR | 1 |
| Croatia | HR | 1 |
| Indonesia | ID | 1 |
| Iceland | IS | 1 |
| Kenya | KE | 1 |
| Latvia | LV | 1 |
| Morocco | MA | 1 |
| Madagascar | MG | 1 |
| North Macedonia | MK | 1 |
| Myanmar (Burma) | MM | 1 |
| Nigeria | NG | 1 |
| Peru | PE | 1 |
| Slovenia | SI | 1 |
| Slovakia | SK | 1 |
| Taiwan | TW | 1 |
| Venezuela | VE | 1 |
| Vietnam | VN | 1 |
| Yemen | YE | 1 |
| Zimbabwe | ZW | 1 |
PL – 25 Nobel Prizes 😊 (mainly Germans and (Russian) Jews born in German/Russian Empires respectively)
Malbork castle, 40 kms from PSW https://www.youtube.com/watch?v=PGkpg9wd3ak
A reviewed paper on tourist traffic in the museum of Malbork Castle titled Parzych Krzysztof, The determinants of the tourist traffic in the castle’s museum of Malbork, Journal of Education, Health and Sport.
This paper demonstrates all the textbook mistakes discussed earlier:
more readable charts (if one insists on using pie charts):
barcharts better, as usual:
What is this?
The distribution of seats in the Sejm after the 2015 elections
A frequently shown chart aimed at convincing public opinion that teachers are much worse off than before: (Average salary as a % of the overall average?)
If you start from zero, it does not look so dramatic…
The collapse of the ruble exchange rate in February/March 2022. What is very wrong with the chart?
Lecture notes/handouts and data sets are available here: https://github.com/hrpunio/Erasmus_2024_Sousse